Pose estimation based on critical point analysis

ABSTRACT

Methods and systems for estimating a pose of a subject. The subject can be a human, an animal, a robot, or the like. A camera receives depth information associated with a subject, a pose estimation module to determine a pose or action of the subject from images, and an interaction module to output a response to the perceived pose or action. The pose estimation module separates portions of the image containing the subject into classified and unclassified portions. The portions can be segmented using k-means clustering. The classified portions can be known objects, such as a head and a torso, that are tracked across the images. The unclassified portions are swept across an x and y axis to identify local minimums and local maximums. The critical points are derived from the local minimums and local maximums. Potential joint sections are identified by connecting various critical points, and the joint sections having sufficient probability of corresponding to an object on the subject are selected.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority under 35 U.S.C. §119(e) to U.S.Provisional Patent Application No. 60/663,020, entitled “Pose EstimationBased on Critical Point Analysis,” filed on Mar. 17, 2005, now abandonedthe subject matter of which is incorporated by reference herein in itsentirety, and to co-pending U.S. Provisional Patent Application No.60/738,413, entitled “Estimating Pose Seciuences from Depth ImageStreams,” filed on Nov. 17, 2005, the subject matter of which isincorporated by reference herein in its entirety.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to machine vision, and more specifically,to machine-based estimation of poses using critical point analysis.

2. Description of the Related Art

Conventional techniques for machine vision allow a robot or othermachine to recognize objects. The objects can be recognized fornavigation around the objects, retrieval of the objects, or otherpurposes. Conventional techniques for pose estimation detect variousobjects of a subject such as body parts of a human. Additionally, poseestimation can determine an orientation of the body part.

One problem with conventional techniques for pose estimation is thecomplexity and expense of equipment needed to capture image information.For example, a 3D camera system typically requires that the subject beconfined to a room in which the cameras are configured. The 3D camerasystem is also very expensive. A marker system allows known points of asubject to be marked and followed throughout motions. However, thesubject has to be prepared ahead of time, and be cooperative withobservation.

Therefore, what is needed is a method of system for estimating poses ofa subject without the expense and complexity of conventional techniques.

SUMMARY

The present invention provides methods and systems for estimating a poseof a subject based on critical point analysis. In one embodiment, asystem includes a camera to receive depth information associated with asubject, a pose estimation module to determine a pose or action of thesubject from images, and an interaction module to output a response tothe perceived pose or action.

In one embodiment, the pose estimation module separates portions of theimage containing the subject into classified and unclassified portions.The portions can be segmented using k-means clustering. The classifiedportions can be known objects, such as a head and a torso, that aretracked across the images. The unclassified portions are swept across anx and y axis to identify local minimums and local maximums. The criticalpoints are derived from the local minimums and local maximums. Potentialjoint sections are identified by connecting various critical points, andthe joint sections having sufficient probability of corresponding to anobject on the subject are selected.

In another embodiment, the pose estimation module comprises anestimation module to select the joint section from the potentialsections. The estimation module implements various rules to calculateprobabilities associated with the potential joint sections. For example,a joint section can be evaluated for how many of its pixels arecommensurate with pixels of the subject. Additional rules are discussedherein.

Advantageously, the system can take visual cues from a human to performcertain actions (e.g., go left, speed up, or stop). Furthermore, thesystem can observe the activities of humans, animals, robots, or othersubjects for logging or other purposes.

The features and advantages described herein are not all inclusive, and,in particular, many additional features and advantages will be apparentto one skilled in the art in view of the drawings, specifications, andclaims. Moreover, it should be noted that the language used in thespecification has been principally selected for readability andinstructional purposes and may not have been selected to circumscribethe claimed invention.

BRIEF DESCRIPTION OF THE DRAWINGS

The teachings of the present invention can be readily understood byconsidering the following detailed description in conjunction with theaccompanying drawings. Like reference numerals are used for likeelements in the accompanying drawings.

FIG. 1 is a block diagram of a system for estimating a pose of a subjectaccording to one embodiment of the present invention.

FIG. 2 is a block diagram of a pose estimation module of the systemaccording to one embodiment of the present invention.

FIG. 3. is a flow chart illustrating method for estimating the pose ofthe subject according to one embodiment of the present invention.

FIG. 4 is a flow chart illustrating a method for identifying criticalpoints according to one embodiment of the present invention.

FIG. 5 is a flow chart illustrating a method for generating a skeletalstructure according to one embodiment of the present invention.

FIG. 6A is a schematic diagram showing a convention 2-D image of a humansubject, while FIG. 6B shows FIG. 6A with depth information inaccordance with one embodiment of the present invention.

FIG. 7 is a schematic diagram of horizontal and vertical sweepsaccording to one embodiment of the present invention.

FIG. 8A is a schematic diagram showing a depth image, while FIG. 8Bshows FIG. 8A with classified and unclassified portions in accordancewith one embodiment of the present invention.

FIG. 9 is a schematic diagram of a library of preloaded poses accordingto one embodiment of the present invention.

FIG. 10 is a schematic diagram of a subject in an image and axes usedfor sweeping the subject for local minimums and local maximums accordingto one embodiment of the present invention.

FIG. 11 is a schematic diagram of a skeletal structure generated fromthe subject image of FIG. 10 according to one embodiment of the presentinvention.

The figures depict embodiments of the present invention for purposes ofillustration only. One skilled in the art will readily recognize fromthe following discussion that alternative embodiments of the structuresand methods illustrated herein may be employed without departing fromthe principles of the invention described herein.

DETAILED DESCRIPTION OF EMBODIMENTS

Methods and systems for estimating a position of a subject aredescribed. The subject can be, for example, a human, an animal, or arobot. As the subject is in motion, or is performing various actions,the subject holds different poses over time. In one embodiment,component point analysis is employed to determine a position of thesubject at a certain time. By tracking positions over time, the motionof action of the subject can be determined. For example, a robot canreact to visual cues of a human such as go left, speed up, and stop.FIGS. 1 and 2 show an exemplary system implementing a method for poseestimation, and FIGS. 3-6 show an exemplary method for pose estimation.One of ordinary skill in the art will understand that, give thedescription herein, additional embodiments are possible.

FIG. 1 is a block diagram illustrating a system 100 for pose estimationaccording to one embodiment of the present invention. System 100comprises a depth camera 110, a pose estimation module 120, aninteraction module 130, and a processor 140. These components can becoupled in communication through, for example, software APIs, a databus, an input/output controller, processor 140, and the like. System 100can be a robot or other machine that interacts with or observes humans.Methods implemented within system 100 are described below.

Camera 110 receives image data, and sends a stream of image frames topose estimation module 120. Camera 110 can be, for example, a pulsebased camera (e.g., manufactured by 3DV Systems, Inc. of Portland,Oreg.) or a modulated light camera (e.g., manufactured by Canesta, Inc.of Sunnyvale, Calif.). In one embodiment, camera 110 captures an imageof the subject including depth information. The depth informationdescribes a distance from the camera to different portions of thesubject. For example, each pixel can include a depth value in additionto traditional values such as contrast, hue, and brightness.

Pose estimation module 120 receives the stream of image frames, andsends an estimated pose to interaction module 130. Pose estimationmodule 120 and interaction module 130 can be implemented in hardwareand/or software. One embodiment of pose estimation module 120 isdescribed in further detail below with respect to FIG. 2. In oneembodiment, pose estimation module 120 uses component point analysis todetermine a pose of the subject in each of the image frames. In anotherembodiment, interaction module 130 tracks poses temporally across themedia stream in order to determine an action of the subject. Forexample, a pose of a finger pointing can indicate a direction, but afinger wagging motion can indicate a degree of rotation. In response todetermining an action, interaction module 130 can cause an action bysystem 100 such as moving in a direction, or rotating from a position.

FIG. 2 shows pose estimation module in greater detail. Pose estimationmodule 120 comprises a critical point module 210, skeletal generationmodule 220, and an estimator module 230.

Critical point module 210 can identify critical points for localminimums and maximums of the subject area. In one embodiment, criticalpoint module 210 performs an x-sweep, or scans in both directions alongthe x-axis. In addition, critical point module 219 can perform a y-sweepand a z-sweep. A local minimum or maximum refers to, with respect to aparticular portion of the subject, an uppermost, lowermost, leftmost, orrightmost point.

Skeletal generation module 220 can generate a skeletal structure for thesubject from joint positions. In one embodiment, skeletal generationmodule 220 forms joint positions by connecting critical points withinthe subject area. Skeletal generation module 220 can implement a set ofrules during the process of finding joint positions and the skeletalstructure. For example, one rule can require that joint positions remainwithin the subject area. Another rule can require that joint positionsspan across the center of a local portion of the subject area. Stillanother rule can require that a logical human configuration ismaintained. These rules are discussed in more detail below.

Estimator module 230 can determine a pose of the subject based on theskeletal structure. In one embodiment, estimator module 230 can useposture criteria to calculate a probability that the skeletal structurematches a pose. Estimator module 230 can be preloaded with a library ofposes, and library of associated skeletal structures as shown in FIG. 9.

FIG. 3 is a flow chart illustrating a method 300 estimating a pose of asubject according to one embodiment of the present invention. The method300 can be implemented in a computer system (e.g., system 100)

A camera (e.g., camera 110) receives 310 an image, including depthinformation associated with the subject. The depth information providesa distance between the camera and portions of the subject (e.g., withrespect to each pixel or group of pixels). The depth information allowsthe image to be segregated based not only on horizontal and verticalaxes, but also based on a depth axes. To illustrate depth information,FIG. 6A shows a figure of a human subject while FIG. 6B shows FIG. 6Awith depth information. A pulse-based camera sends out a pulse ofillumination which echoes off object, and measures an echo. A modulatedlight camera emits light in a sin wave, and measures an amplitude andphase shift of the returned wave.

A critical points module (e.g., critical points module 120) identifies320 critical points from local minimums and local maximums as shown inFIG. 4. An image is spatially partitioned 410 using, for example,k-means clustering. In some cases, neighboring clusters can be combined.Resulting partitions are classified 420 if possible. A partition isclassified by identifying the partition as a known object, such as ahead or torso when the subject is a human. In one embodiment, once anobject is classified, it can be tracked across subsequent image forefficiency. The critical points module applies critical point analysisto the unclassified partition of the subject. More specifically, thecritical points module sweeps 430 a cross-section along the axes. Thelocal minimums and local maximums revealed in the sweeps form thecritical points 540. For example, as shown in FIG. 7, a cross-section710 that is swept vertically across a 3D object 712 reveals localminimum 714, and local maximums 716 a-c. Furthermore, when cross-section710 is swept horizontally across 3D object 712, local minimum 718 andlocal maximum 720 are revealed.

Referring again to FIG. 3, the skeletal generation module (e.g.,skeletal generation module 130) determines 330 a pose of the subjectbased on the skeletal structure as shown in FIG. 5. A set of potentialjoint sections are identified 510 by connecting selected criticalpoints. The potential joint sections are tested using, for example, theposture criteria 520. The posture criteria represents a probability thatthe joint section is associated with an actual object on the subject.The joint section having the highest probability is selected 530. Theclassified portions of the subject area and joint sections inunclassified portions of the subject area are combined 540 to form theskeletal structure. The skeletal structure is compared 6550 to preloadedskeletal structures to determine the pose. Returning to FIG. 7, skeletalstructure 722 is a result of the vertical scan, and skeletal structure724 is a result of the horizontal scan, both of which identified jointsections from critical points. In addition, FIG. 11 is a skeletalstructure resulting from a subject shown in FIG. 10.

The posture criteria can be calculated using the following formula:P(h)=F1(h)F2(h)F3(h)F4(h)F5(h)where Fi corresponds to the i-th constraint defined in the posturecriteria. In other words, each rule can be represented as a probabilitybetween 0 and 1. A product of the probability is used to compare thepotential joint sections. The rules can be expressed by the followingformulas:F ₁(h)=λe ^(−λx)where F1 represents the amount of pixel that are outside of the blob;

${F_{2}(h)} = {\prod\limits_{i = 1}^{N}{\frac{1}{\sqrt{2\pi}\sigma_{1}}{\mathbb{e}}^{\frac{{\_ s}_{i}^{2}}{2\sigma_{1}^{2}}}}}$where F2(h) represents how close the joint section is to the localcenter, or the center of the partition;

${F_{3}(h)} = {\prod\limits_{i = 1}^{M}{\frac{1}{\sqrt{2\pi}\sigma_{2}}{\mathbb{e}}^{\frac{{\_ f}_{i}^{2}}{2\sigma_{2}^{2}}}}}$where F3(h) represents how close the critical points are to the jointsection;

${F_{4}(h)} = {\frac{1}{\sqrt{2\pi}\sigma_{3}}{\mathbb{e}}^{\frac{{({\min{({{{DT\_ Hand} - {DT\_ Elbrow}},0})}})}^{2}}{2\sigma_{2}^{2}}}}$where F4(h) ensures a proper sequence of joint segments (e.g., that ahand connects to an arm, and that a hand does not connect to a foot);

${F_{5}(h)} = {\frac{1}{\sqrt{2\pi}\sigma_{4}}{\mathbb{e}}^{\frac{{({{Hand}^{\;{t - 1}} - {Hand}^{\; t}})}^{2} + {({{Elbow}^{\;{t - 1}} - {Elbow}^{t}})}^{2}}{2\sigma_{4}^{2}}}}$where F5(h) ensures that temporal continuity is preserved betweenimages. Furthermore, x is the number of points located within thesubject area being analyzed (e.g., an arm); s is the distance of asampled skeleton point to the subject area; f is the distance of acritical point to the subject area; DT_Hand is the distance transformedvalue of a point (e.g., on a hand); DT_Elbow is the distance transformedvalue of another point (e.g., on an elbow); Hand is the hand point andElbow is the elbow point in this example; λ is a Poisson distributionparameter; and σ is a Gaussian distribution parameter. Note thatalternative formulations of posture criteria are possible.

Referring again to FIG. 3, an interaction module (e.g., interactionmodule 130) responds to the estimated pose. In one embodiment, theestimated poses can be considered temporally to determine an action. Theinteraction module can output an action that is responsive to theestimated pose of the action.

The order in which the steps of the methods of the present invention areperformed is purely illustrative in nature. The steps can be performedin any order or in parallel, unless otherwise indicated by the presentdisclosure. The methods of the present invention may be performed inhardware, firmware, software, or any combination thereof operating on asingle computer or multiple computers of any type. Software embodyingthe present invention may comprise computer instructions in any form(e.g., source code, object code, interpreted code, etc.) stored in anycomputer-readable storage medium (e.g., a ROM, a RAM, a magnetic media,a compact disc, a DVD, etc.). Such software may also be in the form ofan electrical data signal embodied in a carrier wave propagating on aconductive medium or in the form of light pulses that propagate throughan optical fiber.

While particular embodiments of the present invention have been shownand described, it will be apparent to those skilled in the art thatchanges and modifications may be made without departing from thisinvention in its broader aspect and, therefore, the appended claims areto encompass within their scope all such changes and modifications, asfall within the true spirit of this invention.

In the above description, for purposes of explanation, numerous specificdetails are set forth in order to provide a thorough understanding ofthe invention. It will be apparent, however, to one skilled in the artthat the invention can be practiced without these specific details. Inother instances, structures and devices are shown in block diagram formin order to avoid obscuring the invention.

Reference in the specification to “one embodiment” or “an embodiment”means that a particular feature, structure, or characteristic describedin connection with the embodiment is included in at least one embodimentof the invention. The appearances of the phrase “in one embodiment” invarious places in the specification are not necessarily all referring tothe same embodiment.

Some portions of the detailed description are presented in terms ofalgorithms and symbolic representations of operations on data bitswithin a computer memory. These algorithmic descriptions andrepresentations are the means used by those skilled in the dataprocessing arts to most effectively convey the substance of their workto others skilled in the art. An algorithm is here, and generally,conceived to be a self-consistent sequence of steps leading to a desiredresult. The steps are those requiring physical manipulations of physicalquantities. Usually, though not necessarily, these quantities take theform of electrical or magnetic signals capable of being stored,transferred, combined, compared, and otherwise manipulated. It hasproven convenient at times, principally for reasons of common usage, torefer to these signals as bits, values, elements, symbols, characters,terms, numbers, or the like.

It should be borne in mind, however, that all of these and similar termsare to be associated with the appropriate physical quantities and aremerely convenient labels applied to these quantities. Unlessspecifically stated otherwise as apparent from the discussion, it isappreciated that throughout the description, discussions utilizing termssuch as “processing” or “computing” or “calculating” or “determining” or“displaying” or the like, refer to the action and processes of acomputer system, or similar electronic computing device, thatmanipulates and transforms data represented as physical (electronic)quantities within the computer system's registers and memories intoother data similarly represented as physical quantities within thecomputer system memories or registers or other such information storage,transmission or display devices.

The present invention also relates to an apparatus for performing theoperations herein. This apparatus can be specially constructed for therequired purposes, or it can comprise a general-purpose computerselectively activated or reconfigured by a computer program stored inthe computer. Such a computer program can be stored in a computerreadable storage medium, such as, but is not limited to, any type ofdisk including floppy disks, optical disks, CD-ROMs, andmagnetic-optical disks, read-only memories (ROMs), random accessmemories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any typeof media suitable for storing electronic instructions, and each coupledto a computer system bus.

The algorithms and modules presented herein are not inherently relatedto any particular computer or other apparatus. Various general-purposesystems can be used with programs in accordance with the teachingsherein, or it may prove convenient to construct more specializedapparatuses to perform the method steps. The required structure for avariety of these systems will appear from the description below. Inaddition, the present invention is not described with reference to anyparticular programming language. It will be appreciated that a varietyof programming languages can be used to implement the teachings of theinvention as described herein. Furthermore, as will be apparent to oneof ordinary skill in the relevant art, the modules, features,attributes, methodologies, and other aspects of the invention can beimplemented as software, hardware, firmware or any combination of thethree. Of course, wherever a component of the present invention isimplemented as software, the component can be implemented as astandalone program, as part of a larger program, as a plurality ofseparate programs, as a statically or dynamically linked library, as akernel loadable module, as a device driver, and/or in every and anyother way known now or in the future to those of skill in the art ofcomputer programming. Additionally, the present invention is in no waylimited to implementation in any specific operating system orenvironment.

It will be understood by those skilled in the relevant art that theabove-described implementations are merely exemplary, and many changescan be made without departing from the true spirit and scope of thepresent invention. Therefore, it is intended by the appended claims tocover all such changes and modifications that come within the truespirit and scope of this invention.

1. A method for estimating a pose of a subject in an image, comprising:receiving the image including depth information associated with thesubject; classifying one or more parts of the image as known parts ofthe subject; identifying critical points representing local horizontalminimums, local horizontal maximums, local vertical minimums, and localvertical maximums of unclassified parts of the image from the depthinformation; generating a skeletal structure for the subject from jointpositions, the joint positions formed by connecting critical pointswithin an the unclassified parts of the image; and determining a pose ofthe subject based on the skeletal structure.
 2. The method of claim 1,wherein the step of generating the skeletal structure comprises:determining a set of possible joint positions by identifying pairs ofcritical points that, when connected, remain within the unclassifiedparts of the image.
 3. The method of claim 1, wherein the step ofgenerating the skeletal structure comprises: determining a set ofpossible joint positions by identifying pairs of critical points that,when connected, span closest to a center of the unclassified parts ofthe image.
 4. The method of claim 1, wherein the step of generating theskeletal structure comprises: determining a set of possible jointpositions by identifying pairs of critical points that, when connected,preserve a known human configuration.
 5. The method of claim 1, whereinthe step of generating the skeletal structure comprises: receiving asubsequent image; and determining a set of possible joint positions byidentifying pairs of critical points that, when connected, preservecontinuity between the images.
 6. The method of claim 1, furthercomprising: spatially partitioning the image with k-means clustering,wherein the local minimums and maximums correspond to minimums andmaximums within spatial partitions.
 7. The method of claim 1, whereinthe step of generating the skeletal structure comprises: receiving asubsequent image; identifying subsequent critical points; and generatinga subsequent skeletal image by comparing the critical points against thesubsequent critical points.
 8. A computer-readable medium storing acomputer program product configured to perform a method for estimating apose of a subject in an image, the method comprising: receiving theimage including depth information associated with the subject;classifying one or more parts of the image as known parts of thesubject; identifying critical points representing local horizontalminimums, local horizontal maximums, local vertical minimums, and localvertical maximums of unclassified parts of the image from the depthinformation; generating a skeletal structure for the subject from jointpositions, the joint positions formed by connecting critical pointswithin the unclassified parts of the image; and determining a pose ofthe subject based on the skeletal structure.
 9. The computer programproduct of claim 8, wherein the step of generating the skeletalstructure comprises: determining a set of possible joint positions byidentifying pairs of critical points that, when connected, remain withinthe unclassified parts of the image.
 10. The computer program product ofclaim 8, wherein the step of generating the skeletal structurecomprises: determining a set of possible joint positions by identifyingpairs of critical points that, when connected, span closest to a centerof the unclassified parts of the image.
 11. The computer program productof claim 8, wherein the step of generating the skeletal structurecomprises: determining a set of possible joint positions by identifyingpairs of critical points that, when connected, preserve a known humanconfiguration.
 12. The computer program product of claim 8, wherein thestep of generating the skeletal structure comprises: receiving asubsequent image; and determining a set of possible joint positions byidentifying pairs of critical points that, when connected, preservecontinuity between the images.
 13. The computer program product of claim8, further composing: spatially partitioning the unclassified parts ofthe image with k-means clustering, wherein the local minimums andmaximums correspond to minimums and maximums within spatial partitions.14. The computer program product of claim 8, wherein the step ofgenerating the skeletal structure comprises: receiving a subsequentimage; identifying subsequent critical points; and generating asubsequent skeletal image by comparing the critical points against thesubsequent critical points.
 15. A system for estimating a pose of asubject in an image, comprising:; an input to receive the imageincluding depth information associated with the subject; a criticalpoints module, coupled in communication with the input, the criticalpoints module configured to classify one or more parts of the image asknown parts of the subject and identify critical points representinglocal horizontal minimums, local horizontal maximums, local verticalminimums, and local vertical maximums of unclassified parts of the imagefrom the depth information; a skeletal generation module, coupled incommunication with the critical points module, the skeletal generationmodule configured to form a skeletal structure for the subject fromjoint positions, the joint positions formed by connecting criticalpoints within the unclassified parts of the image; and an estimationmodule, coupled in communication with the skeletal generation module,the estimation module configured to determine a pose of the subjectbased on the skeletal structure.
 16. The system of claim 15, wherein theskeletal generation module determines a set of possible joint positionsby identifying pairs of critical points that, when connected, remainwithin the unclassified parts of the image.
 17. The system of claim 15,wherein the skeletal generation module determines a set of possiblejoint positions by identifying pairs of critical points that, whenconnected, span closest to a center of the unclassified parts of theimage.
 18. The system of claim 15, wherein the skeletal generationmodule determines a set of possible joint positions by identifying pairsof critical points that, when connected, preserve a known humanconfiguration.
 19. The system of claim 15, wherein the input receives asubsequent image, and the skeletal generation module determines a set ofpossible joint positions by identifying pairs of critical points that,when connected, preserve continuity between the images.
 20. The systemof claim 15, wherein the critical points module spatially partitions theimage with k-means clustering, wherein the local minimums and maximumscorrespond to minimums and maximums within spatial partitions.
 21. Thesystem of claim 15, wherein the input receives a subsequent image, thecritical points module identifies subsequent critical points, and theskeletal generation module generates a subsequent skeletal image bycomparing the critical points against the subsequent critical point.